Use R building blocks in creative ways to tackle any task! The challenge is arranging them in creative and specific ways to accomplish meaningful research.
Several vocabulary/jargon terms will help you begin to understand how R works!
(), such as in
mean().<- is the assignment operator. It assigns values on
the right to objects on the left. So, after executing
x <- 5, the value of x is
5.Open a new R script by clicking “File” –> “New File” –> “R
Script”. Save it and then copy/paste the below codeClick a line of code
and on your keyboard press ctrl and enter keys
simultaneously.
Note that no sort of result is displayed on the screen during the assignment step. Type the name of the variable and run it (“call” the variable) to show the output in the console.
x <- 5
x
## [1] 5
Like everything else in R, data have a class (type) associated with
it which determines how we can manipulate it. Use the
class() function to find out. The variable x
is the argument. We will talk about five basic types:
Numeric: decimals; the default class for all numbers in R
Character: text, always wrapped in quotations " "
(single or double are fine, be consistent)
Logical: TRUE and FALSE; stored internally as 1 and 0 and as such take on mathematical properties; useful for turning function parameters “on” or “off”, as well as for data subsetting (see below).
Integer: positive and negative whole numbers, including zero
Factor: categorical data
Numeric
class(x)
## [1] "numeric"
# use a hashtag within a code chunk to comment your code
# note the use of the underscore to represent a space in the variable name
my_name <- "Nerd Squirrel"
my_name
## [1] "Nerd Squirrel"
class(my_name)
## [1] "character"
class(TRUE)
## [1] "logical"
class(FALSE)
## [1] "logical"
TRUE + 1
## [1] 2
FALSE - 1
## [1] -1
# is 3 greater than 4?
3 > 4
## [1] FALSE
# is 4 less than or equal to 4?
4 <= 4
## [1] TRUE
# is "Apple" equal to "apple" (hint: R is case sensitive!)
"Apple" == "apple"
## [1] FALSE
# is "Apple" not equal to "Apple"
"Apple" != "Apple"
## [1] FALSE
We can use various “as dot” functions to convert data types. To convert numeric to integer class for example, we could type:
y <- as.integer(x)
y
## [1] 5
class(y)
## [1] "integer"
# or
y2 <- 8L
y2
## [1] 8
class(y2)
## [1] "integer"
Other “as dot” functions exist as well: as.character(),
as.numeric(), and as.factor() to name a
few:
school <- "Stanford University"
school
## [1] "Stanford University"
class(school)
## [1] "character"
# convert to factor
school_fac <- as.factor(school)
school_fac
## [1] Stanford University
## Levels: Stanford University
class(school_fac)
## [1] "factor"
It might seem difficult keeping track of variables you define, but
remember they are listed in RStudio’s “Environment” tab. You can also
type ls() to print them to the console.
ls()
## [1] "my_name" "school" "school_fac" "x" "y"
## [6] "y2"
You can also use dir() to view the contents of your
working directory (aka the physical location in R’s memory on your
computer, that your RStudio loads from and saves to by default).
Remove a single variable with rm()
rm(my_name)
my_name # Error
Remember to use autocomplete when typing a function or variable name, since there is great potential for humans to make syntactical errors
Alternatively, you can wipe your Environment clean by clicking the yellow broom icon on the Environment tab or by typing
rm(list = ls())
If your environment gets too messy, pressy ctrl + l to
return the prompt to the top and make it more readalble. This also makes
scrolling through your output much easier!
To completely restart your R session, click “Session” –> “Restart R” from the top toolbar menu.
If saving one piece of data in a variable is good, saving many is
better. Use the c() function to combine multiple pieces of
data into a vector, which is an ordered group of the
same type of data.
We can nest the c() function inside of “as dot”
functions to create vectors of different types.
# example numeric (default) vector
traffic_stops <- c(8814, 9915, 9829, 10161, 6810, 8991)
# Integer, logical, and factor vectors
city <- as.factor(c("SF", "DC", "DC", "DC", "SF", "SF"))
year <- as.integer(c(2000, 2000, 2001, 2002, 2001, 2002))
# Call these variables to print them to the screen and check their class
traffic_stops
## [1] 8814 9915 9829 10161 6810 8991
city
## [1] SF DC DC DC SF SF
## Levels: DC SF
year
## [1] 2000 2000 2001 2002 2001 2002
class(traffic_stops)
## [1] "numeric"
class(city)
## [1] "factor"
class(year)
## [1] "integer"
A data frame is an ordered group of equal-length vectors.
More simply put, a data frame is a tabular data structure organized into horizontal rows and vertical columns, i.e. a spreadsheet! These are often stored as comma separated values (.csv) files, or plain text where commas are used to delineate column breaks and that look good in spreadsheet programs like Microsoft Excel.
We can assemble our three vectors from above into a data frame with
the data.frame() function.
police <- data.frame(city, traffic_stops, year)
police
class(police)
## [1] "data.frame"
# display the compact structure of a data frame
str(police)
## 'data.frame': 6 obs. of 3 variables:
## $ city : Factor w/ 2 levels "DC","SF": 2 1 1 1 2 2
## $ traffic_stops: num 8814 9915 9829 10161 6810 ...
## $ year : int 2000 2000 2001 2002 2001 2002
# class = data.frame
# 6 observations (rows)
# 3 variables (columns, or vectors)
# column names are preceded by the dollar sign
Open a new script a create a dataframe that contains 6 rows and 3 columns by following the instructions above.
Advanced: what is the difference between a data frame and a tidyverse tibble?
A vector can be indexed (positionally referenced) by typing its name
followed by its index within square brackets []. For
example, if we want to index just the first thing in the “city” vector,
we could type
city[1]
## [1] SF
## Levels: DC SF
# or
police$city[1]
## [1] SF
## Levels: DC SF
If we want to return just the third element in traffic_stops, we would type
traffic_stops[3]
## [1] 9829
# or
police$traffic_stops[3]
## [1] 9829
Note that R is a “one-indexed” programming language. This means that counting anything starts at 1.
$ single
column subsettingNote that columns are preceded by the dollar sign $. You
can access a single column by typing the name of your data frame, the
$, and then the column name. Note that autocomplete works
for much more than just function and variable names!
# show just the column containing the number of traffic stops
police$traffic_stops
## [1] 8814 9915 9829 10161 6810 8991
# ... which can then easily be plugged into another function
hist(police$traffic_stops)
[,]
Bracket notation subsettingThis can also be extended to rows and columns using bracket notation
[,]
Type the name of your data frame followed by square brackets with a comma inbetween them.
Here, we can enter two indices: one for the rows (before the comma)
and one for the columns (after the comma) like this:
[rows, cols]
For example, if we want two columns, we cannot use the dollar sign operator (since it only works for single columns), but we could type either the indices or the columns names as a vector!
If either the row or column position is left blank, all rows/columns will be retured becuase no subset was specified.
To subset the police with just the city name and number
of stops columns, type
city_and_stops <- police[,c(1,2)]
city_and_stops
# or, for consecutive sequences
city_and_stops <- police[,1:2]
city_and_stops
# or using variable names
city_and_stops <- police[,c("city", "traffic_stops")]
city_and_stops
Keep in mind that redefining a variable will overwrite it each time, as we are doing here.
We can do the same thing for rows by adding a vector of the row indices to include. For example, to keep just rows 1, 2, and 3 along with columns “city” and “traffic_stops” we could type:
subset1 <- police[1:3, c("city", "traffic_stops")]
subset1
Or, to keep rows 1, 2, and 4 along with “city” and “traffic_stops” columns:
subset2 <- police[c(1,2,4), c("city", "traffic_stops")]
subset2
Subset by logical condition by using the logical operators discussed
above: ==, >, <=, etc.
For example, if you want to subset only rows with stops less than 9000 you would combine the dollar sign operator along with bracket notation.
This performs a row subsetting operation based on the condition of a column. Note that the column position is left blank after the comma to indicate all columns should be retured.
low_stops <- police[police$traffic_stops < 9000, ]
low_stops
Or, to include multiple conditions use logical and &
(all conditions must be satisfied) and logical or | (just
one condition must be satisfied).
To subset just rows that contain SF as the city and stops less than 7000, type
sf_low_stops <- police[police$city == "SF" & police$traffic_stops < 7000, ]
sf_low_stops
Open a new script a create a subset that contains data from DC or stops less than or equal to 7000 and just columns “city” and “traffic_stops”
Advanced: use the filter() and select()
functions from the dplyr R package to do the same thing
(hint: see the dplyr section below!)
Lists are R objects that can contain heterogenous types (remember that a vector can only contain data of the same type). For example:
my_list <- list("Grapefruit", TRUE, "Nerd Squirrel", c(3.14, 2.13, 1.45))
my_list
## [[1]]
## [1] "Grapefruit"
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] "Nerd Squirrel"
##
## [[4]]
## [1] 3.14 2.13 1.45
You can alaso name the list elements:
names(my_list) <- c("food", "logical value", "name", "vector")
my_list
## $food
## [1] "Grapefruit"
##
## $`logical value`
## [1] TRUE
##
## $name
## [1] "Nerd Squirrel"
##
## $vector
## [1] 3.14 2.13 1.45
Remember the dollar sign operator $ we used above to
extract just a single column from a data frame? This actually comes from
lists! We can index only the “name” filed from our list like:
my_list$name
## [1] "Nerd Squirrel"
A matrix is similar to a data frame but can only contain data of a single type. Matrices are often used in mathematical calculations. While data frames are restricted to two dimensions (rows and columns), matrices can be of n-dimensions.
ex_matrix <- matrix(4:9, ncol = 2)
ex_matrix
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
## [3,] 6 9
class(ex_matrix)
## [1] "matrix" "array"
class(ex_matrix[1])
## [1] "integer"
# What happened to our numbers in the presence of a single character data element?
ex_matrix2 <- matrix(c(4, 5, 6, 7, 8, 9, 10, 11, "pizza"), ncol = 3)
ex_matrix2
## [,1] [,2] [,3]
## [1,] "4" "7" "10"
## [2,] "5" "8" "11"
## [3,] "6" "9" "pizza"
class(ex_matrix2[1])
## [1] "character"
To call R’s help pages and see how a function is used, in your console type a question mark before a function name.
Read the Description section to learn what the function does. See the Usage section to see which parameters belong to the function. Check out the Arguments section to see the rules for each argument! Often included (but not always) below these sections are Details that offer more information, Value that describes a function’s output, Notes, Authors, References and copy/paste Examples to experiment with.
?data.frame # help with the data.frame() function
?mean # arithmetic mean
?hist # histogram
?lm # look at help pages for linear regression
?glm # generalized linear models
?">" # Wrap symbols in quotations to view their help files
?"&"
Other helpful debugging tools/strategies:
1. Googling the error text, and referring to a forum like StackOverflow.
You can also prompt a search engine with a “how-to” question, such as”R
ggplot2 how to make scatterplot”, for example. 2. (IDE-dependent)
Placing breakpoints in your program and using the debugger tool to step
through the program
3. Strategically place print() statements to know where your program is
reaching/failing to reach
4. Ask a friend! A fresh set of eyes goes a long way when you’re working
on code.
5. Restart your IDE and/or your machine.
6. Schedule
an SSDS consultation
Subsetting with the dollar sign operator $ and bracket
notation [ , ] is incredibly useful, but you will
inevitably encounter major roadblocks when performing more complex
operations. Thankfully, the tidyverse was created in part to
alleviate the many frustrations of using base R data wrangling
methods.
Although your base R installation comes with many helpful features, many R users have written packages (i.e., additional software add-ons to R) that include shortcuts to helpful functions we would have to otherwise write ourselves. This is time saving since R packages exist for complicated tasks in virtually every field of study.
Package installation exists in two steps: 1. Use the
install.packages() function to physically download the
files to your computer. 2. However, your current RStudio session does
not know that these files exists. Use the library()
function to link the downloaded files to your current session. For
example:
# Step 1. Physically download the dplyr and tidyr files
# install.packages("dplyr")
# install.packages("tidyr")
# Step 2. Link these files to your current session
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
Several dplyr functions will help us quickly subset data. 1.
filter() - filter rows based on a condition 2.
select() - select certain columns 3.
group_by() - perform grouping operations 4.
summarize() - create a new dataframe that summarizes each
group to a single row (e.g., group means and standard deviations) 5.
mutate() - create, modify, and delete columns
However, our operations will be based on pipes %>%,
that tell R to take the thing on the left of the pipe and insert it into
the thing on the right.
How does it work?
filter() function, and insert a
condition in the parentheses.filter()For example, to filter only rows in police with stops
greater than 10000:
high_stops <- police %>%
filter(traffic_stops > 10000)
high_stops
select()Or, to filter only rows in police with stops greater
than 10000 and select only the “traffic_stops” and “city” column:
high_stops2 <- police %>%
filter(traffic_stops > 10000) %>%
select(city, traffic_stops)
high_stops2
We can add multiple conditions such as logical and &
and or | * Logical and & posits that all
conditions must be satisfied to be included in the subset. * Logical or
| positits that just one of multiple conditions needs to be
satisfied to be included in the subset.
What if we want only rows from police returned that were
less than 9000 AND occurred in the year 2002?
sub1 <- police %>%
filter(traffic_stops < 9000 & year == 2002)
sub1
How about if we want only rows from police returned that
were less than 9000 OR occurred in the year 2002? Do you think we will
get more or less rows returned? Why?
sub2 <- police %>%
filter(traffic_stops < 9000 | year == 2002)
sub2
group_by() and summarize()group_by() and mutate()While summarize() will create a new tibble,
mutate() creates a new column(s) in an existing tibble.
Open a new script and use the help file for the mutate()
function that, combined with group_by(), creates a new
column that computes the average stops per year for the
police dataframe. (hint: assume there are 365 days in a
year!)
Importing data can be fairly straightforward thanks to R’s syntax and the clickable buttons inside RStudio. Import the “penguins.csv” file using the code below. A .csv file (“comma-separated values”) is a text file where information is stored and where columns and rows are indicated by positions of the commas, thus it is easy converted to a data frame!
penguins <- read.csv("data/raw/penguins.csv")
head(penguins) # show first 6 rows by default
str(penguins) # show compact structure: class, nrow, ncole, column names and types, and examples of the data
## 'data.frame': 344 obs. of 7 variables:
## $ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : chr "MALE" "FEMALE" "FEMALE" "" ...
nrow(penguins) # number of rows
## [1] 344
ncol(penguins) # number of columns
## [1] 7
dim(penguins) # number of rows by columns
## [1] 344 7
sum(is.na(penguins)) # total number of missing values
## [1] 8
sum(is.na(penguins)) / (nrow(penguins) * ncol(penguins)) # calculate % of missing data for the dataset
## [1] 0.003322259
Alternatively you can click the “Import Dataset” button as in the screenshot below!
frankHowever, data are not always nicely formatted for exploration and analysis. This means that before we clean and subet data, we actually have to reshape it first. We often want the format to be tidy, which makes the data ready for summarization, visualization, and analysis. Check out the Carpentries lesson on data formats to learn more.
Signs of messy datasets - Column headers are values, not variable names. - Multiple variables are not stored in one column. - Variables are stored in both rows and columns. - Multiple types of observational units are stored in the same table. - A single observational unit is stored in multiple tables.
Let’s take a look at one example using dplyr and
tidyr.
library(dplyr); library(ggplot2); library(tidyr)
gap_wide <- read.csv("data/raw/gapminder_wide.csv")
head(gap_wide)
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
-continent, -country) %>%
separate(obstype_year,
into = c('obs_type','year'),
sep = "_",
convert = TRUE)
str(gap_long)
## 'data.frame': 5112 obs. of 5 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obs_type : chr "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
## $ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
## $ obs_values: num 2449 3521 1063 851 543 ...
head(gap_long)
tail(gap_long)
# Plot Japan's gdpPercap by year
japan <- gap_long %>%
filter(obs_type == "gdpPercap",
country == "Japan")
ggplot(japan, aes(x = year, y = obs_values)) +
geom_line() +
theme_minimal()
# Or, multiple countries in a group
gdp_oceania <- gap_long %>%
filter(obs_type == "gdpPercap",
continent == "Oceania")
ggplot(gdp_oceania, aes(x = year,
y = obs_values,
color = country)) +
geom_line() + theme_minimal()
R allows you to easily manipulate and subset strings in a variety of ways. Instead of getting lost in the weeds, below is a high level overview for importing, preprocessing, and analyzing text.
dir_of_texts <- file.path("data/raw/novels/")
head(dir_of_texts)
## [1] "data/raw/novels/"
dir(dir_of_texts)
## [1] "dracula.txt" "frankenstein.txt"
# install.pacakges("readtext")
texts_df <- readtext::readtext(dir_of_texts)
texts_df
# install.packages("corpus")
corpus <- tm::Corpus(tm::VectorSource(texts_df))
corpus
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 2
glimpse(corpus)
## Classes 'SimpleCorpus', 'Corpus' hidden list of 3
## $ content: Named chr [1:2] "The Project Gutenberg eBook of Dracula, by Bram Stoker\n\nThis eBook is for the use of anyone anywhere in the U"| __truncated__ "The Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft (Godwin) Shelley\n\nThis eBook is for the u"| __truncated__
## ..- attr(*, "names")= chr [1:2] "dracula.txt" "frankenstein.txt"
## $ meta :List of 1
## ..$ language: chr "en"
## ..- attr(*, "class")= chr "CorpusMeta"
## $ dmeta :'data.frame': 2 obs. of 0 variables
# install.packages("stm")
clean_corpus <- stm::textProcessor(documents=texts_df$text,
# metadata = texts_df$doc_id,
lowercase = TRUE, #*
removestopwords = TRUE, #*
removenumbers = TRUE, #*
removepunctuation = TRUE, #*
stem = TRUE, #*
wordLengths = c(3,Inf), #*
sparselevel = 1, #*
language = "en", #*
verbose = TRUE, #*
onlycharacter = TRUE, # not def
striphtml = FALSE, #*
customstopwords = NULL, #*
v1 = FALSE) #*
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
clean_corpus
## A text corpus with 2 documents, and an 7709 word dictionary.
st_model <- stm::stm(documents = clean_corpus$documents,
vocab = clean_corpus$vocab,
K = 2,
max.em.its = 75,
data = clean_corpus,
init.type = "Spectral")
## Warning in stm::stm(documents = clean_corpus$documents, vocab =
## clean_corpus$vocab, : K=2 is equivalent to a unidimensional scaling model which
## you may prefer.
## Beginning Spectral Initialization
## Calculating the gram matrix...
## Finding anchor words...
## ..
## Recovering initialization...
## .............................................................................
## Initialization complete.
## ..
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.408)
## ..
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.392, relative change = 2.069e-03)
## ..
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.392, relative change = 8.495e-05)
## ..
## Completed E-Step (0 seconds).
## Completed M-Step.
## Model Converged
st_model
## A topic model with 2 topics, 2 documents and a 7709 word dictionary.
Who is Hawkin and where is Piccadilli? What about Elizabeth and Geneva?
stm::labelTopics(st_model)
## Topic 1 Top Words:
## Highest Prob: said, one, come, will, look, must, time
## FREX: carfax, movement, stern, varna, hawkin, lit, piccadilli
## Lift: aback, abat, abnorm, aboot, abreast, acrid, afar
## Score: hels, van, luci, mina, though, jonathan, count
## Topic 2 Top Words:
## Highest Prob: one, will, feel, now, yet, man, father
## FREX: elizabeth, clerval, justin, felix, perceiv, innoc, geneva
## Lift: accent, acquit, adam, affirm, agreeabl, albertus, alleg
## Score: elizabeth, clerval, justin, felix, perceiv, innoc, geneva
Call ?stm::labelTopics to see what Highest Prob, FREX,
Lift, and Score mean.
Requirements for ggplot2 ggplot 1. Layers to control different
aspects of the plot (coordinates, shapes and colors mapped from data, )
2. The + symbol to connect different layers 3. Data to be
mapped 4. Geoms to tell R how the data should be represented (points,
bars, lines, etc) 5. Themes for customizing scales, axes, labels, and
much more
Histograms can be used to visualize variable distributions. The x-axis tells us how many observations have a value within a particular range, which is specified on the y-axis.
hist()hist(penguins$body_mass_g)
geom_histogram()library(ggplot2)
ggplot(data = penguins, aes(x = body_mass_g)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
# note the error message! Try adjusting bins =
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(bins = 5)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
Boxplots are also used to visualize variable distributions, but with more information attached.
boxplot()boxplot(penguins$body_mass_g)
# or by a grouping variable
boxplot(penguins$body_mass_g ~ penguins$species)
geom_boxplot()ggplot(penguins, aes(x = species,
y = body_mass_g)) +
geom_boxplot() +
theme_bw()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
geom_col()# create table of means with dplyr::summarize
library(dplyr)
(bmg_means <- penguins %>%
group_by(species) %>%
summarize(mean_bmg = mean(body_mass_g, na.rm = TRUE)))
# plot
ggplot(bmg_means, aes(x = species, y = mean_bmg)) +
geom_col() +
theme_classic() +
scale_y_continuous(limits = c(0, 8000))
Scatterplots can be used to look at the relationships of two continuous variables.
plotplot(x = penguins$flipper_length_mm,
y = penguins$bill_length_mm)
geom_point()ggplot(penguins, aes(x = flipper_length_mm,
y = bill_length_mm,
color = species)) +
geom_point() +
theme_minimal()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
geom_jitter()Jittered points can be used when you seek to make a scatterplot but with a discrete-ish variable…
What is the difference between the below two plots?
A <- ggplot(penguins, aes(x = flipper_length_mm,
y = island)) +
geom_point() + theme_bw() +
xlab("Flipper Length (mm)")
B <- ggplot(penguins, aes(x = flipper_length_mm,
y = island)) +
geom_jitter(height = .1) + theme_bw() +
xlab("Flipper Length (mm)") + ylab("")
# install.packages("patchwork")
library(patchwork)
(A + B )
## Warning: Removed 2 rows containing missing values (`geom_point()`).
## Removed 2 rows containing missing values (`geom_point()`).
geom_line() to create a line plot of lifeExpectancy
through time using your tidy-formatted gap dataset from
Challenge 4 above. Before plotting however, use dplyr to pipe in your
group_by() and summarize() functions directly
into ggplot.Whether before or after you visualize, it is usually good to crunch some actual numbers as a companion to whichever visualizations you produce.
Summarize your data in terms of its measures of central tendency and dispersion, skew, kurtosis, etc.
mean(), sd(), and median()gap <- read.csv("data/raw/gapminder-FiveYearData.csv")
mean(gap$lifeExp)
## [1] 59.47444
sd(gap$lifeExp)
## [1] 12.91711
median(gap$lifeExp)
## [1] 60.7125
Let’s go through the STHDA PCA methods in R practical guide together
?describeBy to see how to produce this function’s default
summary statistics as grouped by continent.After acquiring, importing, wrangling, and exploring/visualizing data, you may want to formally test parts of your data differences between groups, relationships among variables, etc. Remember that we only show how they work in R here - seeing if your data fits the assumptions of a given test is your responsibility! Schedule an SSDS consultation if you have questions about hypotheses, or anything else covered in this bootcamp!
Thankfully, R easily calculates these statistics for us in many testing scenarios.
t.test()f_penguins <- penguins %>% filter(sex == "FEMALE")
m_penguins <- penguins %>% filter(sex == "MALE")
t.test(x = f_penguins$body_mass_g,
y = m_penguins$body_mass_g)
##
## Welch Two Sample t-test
##
## data: f_penguins$body_mass_g and m_penguins$body_mass_g
## t = -8.5545, df = 323.9, p-value = 4.794e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -840.5783 -526.2453
## sample estimates:
## mean of x mean of y
## 3862.273 4545.685
aov()penguins_aov <- aov(body_mass_g ~ species, data = penguins)
summary(penguins_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 146864214 73432107 343.6 <2e-16 ***
## Residuals 339 72443483 213698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
TukeyHSD(penguins_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = body_mass_g ~ species, data = penguins)
##
## $species
## diff lwr upr p adj
## Chinstrap-Adelie 32.42598 -126.5002 191.3522 0.8806666
## Gentoo-Adelie 1375.35401 1243.1786 1507.5294 0.0000000
## Gentoo-Chinstrap 1342.92802 1178.4810 1507.3750 0.0000000
cor.test()cor.test(penguins$body_mass_g, penguins$flipper_length_mm)
##
## Pearson's product-moment correlation
##
## data: penguins$body_mass_g and penguins$flipper_length_mm
## t = 32.722, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.843041 0.894599
## sample estimates:
## cor
## 0.8712018
lm()penguins_lm <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
summary(penguins_lm)
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1058.80 -259.27 -26.88 247.33 1288.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5780.831 305.815 -18.90 <2e-16 ***
## flipper_length_mm 49.686 1.518 32.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 394.3 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
## F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
targets_testtargets_test.Rproj file. Also add:functions.R. Copy/paste the below code:fetch_data <- function(file) {
read_csv(file, col_types = cols()) %>%
as_tibble()
}
fit_model <- function(data){
lm(flipper_length_mm ~ bill_length_mm, data) %>%
coefficients()
}
plot_model <- function(model, data){
ggplot(data) +
geom_point(aes(x = flipper_length_mm,
y = bill_length_mm)) +
geom_abline(intercept = model[1], slope = model[2]) +
theme_bw()
}
targets_test console to:
to load the requisite libraries and source the functions.# load libraries
library(dplyr); library(ggplot2); library(targets); library(tibble); library(readr)
# source functions to 1) fetch data, 2) fit a linear model, and 3) plot the results
source("R/functions.R")
# define the moving parts
file <- "penguins.csv"
data <- fetch_data(file)
model <- fit_model(data)
figure <- plot_model(model, data)
# create the targets file
targets::use_targets()
_targets.R file together. Run
tar_make(). What happened? (hint: the output is inside of
the “_targets/objects” folder!)The bootcamp materials were written in R Markdown and allow us to convert our code and text to various document formats such as HTML, MS Word, PDF, and ioslides. Remember: * Text is entered normally * Code should be entered in chunks
# hashtags work normally inside of code chunks
save()
and load()Save only only variables and functions you want in an “R.Data” file
with the save() function like so:
save(penguins, gap_long, file = "data/preprocessed/2023May12_penguins_gap_long.RData")
Now, wipe your global environment clean! Load only the saved
variables with load()
load("data/preprocessed/2023May12_penguins_gap_long.RData")